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Abstract. Laplacian matrices of graphs arise in large-scale computational applications such 
as semi-supervised machine learning; spectral clustering of images, genetic data and web pages; 
transportation network flows; electrical resistor circuits; and elliptic partial differential equations 
discretized on unstructured grids with finite elements. A Lean Algebraic Multigrid (LAMG) solver 
of the symmetric linear system Ax = b is presented, where A is a graph Laplacian. LAMG's run 
time and storage are empirically demonstrated to scale linearly with the number of edges. 

LAMG consists of a setup phase during which a sequence of increasingly-coarser Laplacian sys- 
tems is constructed, and an iterative solve phase using multigrid cycles. General graphs pose al- 
gorithmic challenges not encountered in traditional multigrid applications. LAMG combines a lean 
piecewise-constant interpolation, judicious node aggregation based on a new node proximity measure 
(the affinity), and an energy correction of coarse-level systems. This results in fast convergence and 
substantial setup and memory savings. A serial LAMG implementation scaled linearly for a diverse 
set of 3774 real- world graphs with up to 47 million edges, with no parameter tuning. LAMG was 
more robust than the UMFPACK direct solver and Combinatorial Multigrid (CMG), although CMG 
was faster than LAMG on average. Our methodology is extensible to eigenproblems and other graph 
computations. 

Key words. Linear-scaling numerical linear solvers, graph Laplacian, aggregation-based alge- 
braic multigrid, piecewise-constant interpolation operator, high-performance computing. 
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1. Introduction. Let G = (TV, £ , w) be a connected weighted undirected graph, 
where J\f is a set of n nodes, £ is a set of m edges, and w : £ — > K + is a weight function. 
The Laplacian matrix A nx „ is naturally defined by the quadratic energy 

£(x):=x T Ax= w uv {x u -x v )\ xel^, (1.1) 

(u,o)e£ 

where x T denotes the transpose of x. In matrix form, 

-W uv , v G £u ■= W ■ (u, v') G 8} , (1.2) 

, otherwise. 

A is Symmetric Positive Semi-definite (SPS), and has zero row sums and 2m + n 
non-zeros. Typically, m -C n 2 and A is sparse. Our approach also handles some SPS 
Laplacian matrices corresponding to negative edge weights, such as high-order and 
anisotropic grid discretizations [12] |6j ; those are discussed in £15.21 

Since G is connected, A's null space is spanned by the vector of ones 1 (a dis- 
connected graph can be decomposed into its components in 0(m) time [58, 63 ). We 
consider the nonsingular compatible linear system [65, pp. 185-186] 

Ax = b (1.3a) 
l T x = 0, (1.3b) 
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Fig. 1.1. A 5-node graph and its corresponding Laplacian matrix. 



where b S M is a given zero-sum vector, and x e Mr is the vector of unknowns. 

Our goal is to develop an iterative numerical solver of (|1.3|) that requires 0(m) 
storage and 0(r7ilog(l/e)) operations to generate an e-accurate solution, for graphs 
arising in real-world applications. The hidden constants should be small (in the 
hundreds, not millions). The solver should require a smaller cost to re-solve the 
system for multiple b's - a useful feature for time-dependent and other applications. 

Importantly, we are interested in good empirical performance (bounded hidden 
constants over a diverse set of test instances), and do not consider the problem of 
designing an algorithm with provably linear complexity in any graph |61l Problem 
5, p. 18]. While proofs are important, they often provide unrealistic bounds or no 
bounds at all on the hidden constants that can be attained in practice. 

1.1. Applications. The linear system (|1.3j) is fundamental to many applica- 
tions; see Spielman's review [5T] §2] for more details: 

• Elliptic Partial Differential Equations (PDEs) discretized on unstructured 
grids by finite elements within a fluid dynamics simulation [7] [31] . 

• Interior-point methods for network flow linear programming [23, 32 . 

• Electrical flow through a resistor network G. 

Additionally, Ax = b is a stepping stone toward the eigenproblem. Our multilevel 
methodology can be extended to compute the smallest eigenpairs of A with minor 
adaptations (cf. £|6.4|) . The Laplacian eigenproblem is central to graph regression 
and classification in machine learning }20[ 167] . spectral clustering of images, graph 
embedding [59j , and dimension reduction for genetic ancestry discovery [45] . Of par- 
ticular interest is the Fiedler value - the smallest non-zero eigenvalue of A, which 
measures the algebraic connectivity of G [211 §1.1] and is related to minimum cuts 
[27] . Although we believe it is preferable to develop multiscale strategies for the origi- 
nal formulations of these problems, as demonstrated by the works [36] ESI [55] [60] and 
graph partitioning packages [37 , a fast black-box eigensolver is a practical alternative. 

1.2. Related Work. There are two main approaches to solving (|1.3[) : direct, 
leading to an exact solution (up to round-off errors); and iterative, which typically 
requires a one-time setup cost, followed by a solve phase that produces successive 
approximations x to x to achieve e-accuracy, namely, 

||x-x|| A <e||x|| A , ||x|| A := v^W- (1-4) 

1.2.1. Direct Methods. The Cholesky factorization with a clever elimination 
order can be applied to A. A permutation matrix P is chosen and factorization 
P T AP = LL T constructed so that the lower triangular L is as sparse as possible, 
using Minimum or Approximate Minimum Degree Ordering [TJ I64j . Except for simple 
graphs, direct algorithms do not scale, requiring 0(n 15 ) operations for planar graphs 
[53] 3B] and 0(n 3 ) in general. Alternatively, fast matrix inversion can be performed 
in 0(n 2 376 ) or combined with Cholesky, yet yields similar complexities 61, §3.1]. 
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1.2.2. Iterative Methods: Graph Theoretic. These are variants of the pre- 
conditioned conjugate method [35] §10.3] that achieve (jl.4l) in 0{\J ^(AB" 1 ) log(l/e)) 
iterations for a preconditioner B; n is the finite condition number [61 , §3.3]. 

Spielman and Teng (S-T) [62] and subsequent works [41] have been focusing 
on multilevel graph-sparsifying preconditioners. The S-T setup builds increasingly 
smaller graphs, alternating between partial Cholesky and ultra-sparsification steps, in 
which the graph is partitioned into sets of high-conductance nodes without removing 
too many edges. The complexity is a near-linear 0(mlog 2 nlog(l/e)), guaranteed for 
any symmetric diagonally-dominant A. Unfortunately, no implementation is available 
yet, nor is there a guarantee on the size of the hidden constant. 

1.2.3. Iterative Methods: AMG. Algebraic Multigrid (AMG) is a class of 
high-performance linear solvers, originating in the early 1980s [21 §1.1], [15], [56] and 
under active development. During setup, AMG recursively constructs a multi-level 
hierarchy of increasingly coarser graphs by examining matrix entries, without relying 
on geometric information. The solve phase consists of multigrid cycles. AMG can 
be employed either as a solver or a preconditioner [651 App. A]. Open-source parallel 
implementations include Hypre [3D| and Trilinos-PETSc [35]. In classical AMG, the 
coarse set is a subset of Af; alternatively aggregation AMG [55] App. A. 9], [81 l4l 15*2] 
and smoothed aggregation [TO] define coarse nodes as aggregates of fine nodes. 

AMG mainly targets discretized PDEs, where it has been successful [31] ■ Ad- 
vanced techniques have ventured to widen its scope by increasing the interpolation 
accuracy. These include Bootstrap AMG [11] §17.2], [13] adaptive smoothed ag- 
gregation [17] . and interpolation energy minimization [54] . While these methods 
approach linear scaling for more systems, their complexity cannot be controlled in 
general graphs (cf. §3.1.2|) . 

At the same time, accelerated aggregation AMG has become a hot research topic. 
A crude caliber- 1 (piecewise-constant) interpolation is employed between levels to 
reduce runtime and memory costs, at the expense of a slower cycle that is subsequently 
accelerated. Notay [52] [53] aggregated nodes based on matrix entries and applied 
multilevel CG acceleration with a large cycle index to obtain a near-optimal solver 
for convection-diffusion M-matrices, but the method was limited to those grid graphs. 
Caliber- 1 interpolation was tested for a single graph by Bolten et. al within the 
bootstrap framework [5 , but their setup cost was large and required parameter tuning. 

The present work aims at generalizing AMG to graph Laplacians and addresses 
peculiarities not encountered in traditional AMG applications. To the best of our 
knowledge, the only other solver targeting general topologies is Combinatorial Multi- 
grid (CMG) [32], a hybrid graph-theoretic- AMG preconditioner that partitions nodes 
into high-conductance aggregates, similarly to S-T. CMG outperformed classical AMG 
for a set of 3-D image segmentation applications. In our experiments over a much 
larger graph collection, CMG and LAMG had comparable average solve speeds, yet 
LAMG's performance was much more robust, with almost no outliers. 

1.3. Our Contribution. We present Lean Algebraic Multigrid (LAMG): a 
practical graph Laplacian linear solver. A Matlab LAMG implementation scaled 
linearly for a set of 3774 real-world graphs with up to 47 million edges, ranging from 
computational fluid dynamics to web, biological and social networks. Specifically, the 
setup phase required on average 200 Matrix- Vector Multiplications (MVMs) and Am 
storage bytes, and the average solve time was « 271og(l/e) MVMs per right-hand 
side. The standard deviations were small with only three outliers. LAMG was more 
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robust than the UMFPACK direct solver and CMG, although CMG was faster on 
average fi )5.ip . Our methodology is extensible beyond the scope of S-T and CMG, to 
non-diagonally-dominant f ij5.2p . eigenvalue, and nonlinear problems ( M6.4|) . 

LAMG is an accelerated caliber- 1 aggregation- AMG algorithm that builds upon 
the state-of-the-art AMG variants, yet introduces four new ideas essential to attaining 
optimal efficiency in general graphs: 

(a) Lean methodology ( §3.1j) . LAMG advocates using minimalistic over sophisti- 
cated AMG components. No parameter tuning should be required. In particu- 
lar, we apply caliber-1 interpolation between levels, constructed using relaxed 
Test Vectors (TVs), but without bootstrapping them as in the papers [51H3]. 
Fast asymptotic convergence is achieved by the following three ideas. 

(b) Low-degree Elimination f H3.2p . Like the S-T method |62j , we eliminate low- 
degree nodes prior to each aggregation. However, the role of elimination here 
is different: it removes the effectively- 1-D part of the graph, thereby eliminat- 
ing extreme tradeoffs between complexity and accuracy in aggregating the 
remaining graph. Thus, unlike S-T, our elimination need not strictly reduce 
the number of edges (allowing us to eliminate nodes of higher degrees) nor 
be exact (making it also useful in the eigenproblem) . The elimination and 
aggregation are in fact specializations of the same coarsening scheme ( tj6.4p . 

(c) Affinity f ^3.3p . Aggregation is based on a new normalized relaxation-based 
node proximity heuristic. Ron et al. |55j were the first to use TVs to measure 
algebraic distance between nodes, but our measure is effective for a wider 
variety of graph structures. The affinity admits a statistical interpretation 
and approximates the diffusion distance 22 . In contrast, the S-T algorithm 
strives to create high-conductance aggregates [521 ED] ■ 

(d) Energy- corrected Aggregation fi )3.4p . Recognizing that caliber-1 interpolation 
leads to an energy inflation of the coarse- level system ( 33.4.3p . our aggregation 
also reuses test vectors to minimize coarse-to-fine energy ratios. We offer two 
alternative energy corrections to accelerate the solution cycle: a flat correction 
to the Galerkin operator resemblant of Bracss' work [5], yet resulting in a 
superior efficiency; and an adaptive correction to the solution via multilevel 
iterate recombination |65| §7.8.2] that is even more efficient. 

Following an AMG prelude in fj2[ we discuss each of the main ideas in Sj3l They 
are integrated into the complete LAMG algorithm in §3] (cf. the expanded ArXiV 
e-print [49] for implementation details). Our development methodology emphasizes 
learning from examples: we studied instances for which the original design was slow 
to derive general aggregation rules. Testing over a large collection ensured that new 
rules did not spoil previous successes. The results are presented in £j5j Coarsening 
improvements and extensions to the eigenproblem and other graph computational 
problems are outlined in Sj5J 

2. Algebraic Multigrid Basics. Relaxation methods for Ax = b such as Ja- 
cobi or Gauss-Seidel slow down asymptotically. Yet after only several sweeps, the 
error e := x — x in the approximation x to x becomes algebraically smooth: its nor- 
malized residuals are much smaller than its magnitude |141 §1.1] (assuming its mean 
has been subtracted out). In AMG, these errors are approximated by an interpolation 
P nx „ c from a coarse subspace: e ks Pe c , where e c is a coarse vector of size n c . 

Recognizing that Ax = b corresponds to the quadratic minimization 



x = min Et t(y) , E tot (x) := -x T Ax - x T b , 
y 2 



(2.1) 
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the variational correction scheme |15) seeks the optimal correction in the energy norm, 

e c = argmin E tot (x + Py c ) . (2.2) 

The resulting two-level cycle (except for determining x's mean) is 

1. Perform 2-3 relaxation sweeps on Ax = b, resulting in x. 

2. Compute an approximation e c to the solution e c of 

A c e c = b c , A c := P T AP , b c P T (b - Ax) . (2.3) 

3. Correct the fine-level solution: 

x^x + Pe c . (2.4) 

Eq. (|2.3|) , called Galerkin coarsening, is a smaller linear system. In the multilevel 
cycle, it is recursively solved using 7 two-level cycles, where the cycle index 7 is an 
input parameter. Our interpolation P is full-rank with unit row sums (see £13.1. 4p 
for which it easy to verify that A c is also a connected graph Laplacian. Overall, 
LAMG constructs a hierarchy of L increasingly coarser Laplacian systems ("levels") 
A'x' = b , I = 1, . . . , L, the finest being the original system A 1 := A, b 1 := b. The 
setup phase depends on A only, and produces {{A 1 , P z )}^7 2 , where A' is ni x rt; 
and P is the x ni interpolation matrix from level I to I — 1. We denote by 
G l = (A/"', £ l ) the graph corresponding to A 1 . 

3. LAMG: Main Ideas. Our description refers to a single coarsening stage 
(A — > A c ) and applies to each pair of levels. 

3.1. Lean Methodology. 

3.1.1. Relaxation. Our choice is Gauss-Seidel (GS) relaxation, defined by the 
successive updates [HI §1.1] 

hoi u = 1, . . . ,n , x u i . (3.1) 

We picked GS because it is an effective smoother in SPS systems 14, §1] and does 
not require parameter tuning (such as the Jacobi relaxation damping parameter [5]). 

3.1.2. Interpolation Caliber. Textbook multigrid convergence for the Poisson 
equation requires that the interpolation of corrections P be second-order [9] §3.3]. The 
analogous AMG theory implies a similar condition on the interpolation accuracy of 
low-energy errors. While a piecewise-constant P is acceptable in a two-level cycle, 
it is insufficient for V-cycles (7 = 1); W-cycles (7 = 2) are faster but costly [551 
p. 471], [5J]. 

Constructing a second-order P is already challenging in grid graphs [T31 HH1 1M] • 
We argue that it is infeasible in general graphs: 

(a) The graph's effective dimension d is unknown; had it been known, the required 
interpolation caliber, i.e., the number of coarse nodes used to interpolate a fine 
node, would grow with d and result in unbounded interpolation complexity. 

(b) Identifying a proper interpolation set (whose "convex hull" contains the fine 
node) is a complex and costly process [13] , 

(c) The Galerkin coarse operator P T AP fills in considerably; cf. ^3.1.31 
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In contrast, LAMG employs a lean caliber-1 (piecewise-constant) P, equivalent to 
an aggregation of the nodes into coarse-level aggregates, and corrects the energy of 
the coarse-level Galerkin operator to maintain good convergence (in practice, the 
correction is applied to the coarse right-hand side). This could not have been achieved 
within the variational setting of which only permits modifying P. Here the barrier 
to fast convergence is the coarse-to-fine operator energy ratio. Our contribution is an 
algorithm that yields a small energy ratio, which translates into optimal efficiency; 
cf. t j3.41 The compatible relaxation performance predictor [TDJ HZl US] > [HI §§14.2— 
14.3] is irrelevant for low interpolation accuracy; the energy ratio is a better predictor. 

3.1.3. Managing Fill-in. Frequently, the coarse-level matrices in AMG hierar- 
chies become increasingly dense. This is a result of a poor aggregation, a high-caliber 
P, or both: many fine nodes whose neighbor sets are disjoint are aggregated, creat- 
ing additional edges among coarse-level aggregates. This renders the ideally-accurate 
interpolation irrelevant, because the actual cycle efficiency (error reduction per unit 
work) is small albeit convergence may be rapid. While fill-in is often controllable in 
grid graphs because their coarsening is still local, it is detrimental in non-local graphs. 

LAMG's interpolation is designed to minimize fill-in. Heuristically, the sparser 
P, the sparser P T AP. We further indirectly control fill-in via our affinity criterion 
f H3.3p . which tends to aggregate nodes that share many neighbors. The cycle work is 
also restrained by a fractional cycle index jT4] §6.2] between 1 and 2; cf. ^4.114.21 

Occasionally, the interpolation caliber may be slightly increased as long as the 
number of coarse edges does not become too large; see M6.ll 

3.1.4. Utilizing Extenuating Circumstances. Specific properties of graph 
Laplacians are exploited to simplify the LAMG construction. 

• Since A has zero row sums, its null-space consists of constant vectors. A 
fundamental AMG assumption is that all near-null-space errors can be fitted 
by a single interpolation from a coarse level [14| p. 8]. Here it implies that 
the interpolation weights can be apriori set to 1. The unit-weight assumption 
is easily verified for Laplacians with bounded node degrees [551 p. 439]; in 
the most interesting applications of this work, however, the node degree is 
unbounded, and a proof of this conjecture is an open problem. 

• Some graph locales are effectively one-dimensional: many nodes have degree 
1-2. Such nodes can be quickly eliminated similarly to the paper [41] ( §3.2p . 

• In other graphs, Gauss-Seidel is an efficient solver and no coarsening is re- 
quired. These include complete graphs, star graphs, expander graphs [611 §1], 
and certain classes of random graphs. More generally, if GS converges fast 
for a subset of the nodes, they can all be aggregated together f i|3.4.2p . 

3.2. Low-degree Elimination. We first attempt to eliminate from A/" an inde- 
pendent set J- of nodes u of degree \£ u \ < 4. The set is identified by initially marking 
all nodes as "eligible"; we then sweep through nodes, adding each eligible low-degree 
node to T and marking its neighbors as ineligible [49,, Algorithm 1]. 

Eliminating a node connects all its neighbors; therefore, J-"-nodes of degree < 3 
do not increase m. When \E U \ = 4, m might be increased by at most 2. However, we 
assume that this is unlikely to happen for many .F-nodes and eliminate these nodes 
as well (in practice, we have observed that the neighbors are already connected prior 
to elimination). Eliminating larger degrees results in an impractical fill-in. 
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Let C := M\J-. Ax = b reduces to the Schur complement system 

A c x c = b c , A C :=P T AP, b c :=P T b, P := II (-A£ C A^, , I C ) T , (3.2) 

where II is a permutation matrix such that II T x lists the all J- node values, then 
all C node values. (|3.2[) is a smaller Laplacian system for which we perform further 
elimination rounds, until \T\ becomes small [JHJ Algorithm 2]. 

The purpose of elimination is to reduce n while incurring a small fill-in, and 
to remove the 1-D part of the graph, which cannot be effectively coarsened by the 
energy-corrected aggregation of W3A\ Note that P's caliber is larger than 1 here. 

Viewed as a full approximation scheme f q6.4j) . p. 21) could be generalized to a non- 
exact elimination for approximating the lowest eigenvectors of A, instead of forming 
a nonlinear Schur complement [3]. Additionally, a larger set of loosely- coupled nodes 
could also be eliminated: if A.jrjr is strongly diagonally-dominant, its inverse can be 
approximated by a few Jacobi relaxations. The two-level convergence rate and fill-in 
would need to be kept in check in that case. We plan to pursue these generalizations 
in a future research. 

We hereafter denote the coarse system by Ax = b (either (|1.3a|) . if q = 0, or 
(|3.2[l ). which is further coarsened by caliber- 1 aggregation in §333HS31 

3.3. Affinity. The construction of an effective aggregate set hinges upon defining 
which nodes in Af are "proximal", i.e., nodes whose values are strongly coupled in all 
smooth (i.e., low-energy) vectors [65l p. 473]. Table [3~T1 lists three definitions. 



Classical AMG [56 


1 - \w uv 


1 max{max s \w us ,max s k; s „|} 


Algebraic Distance [55] 


maxf =1 


r (*0 _(fc) 




Affinity (LAMG) 


Cuv - — 1 ( • X x ; 

(X U ,X V ) :— J2k=i x 


\"/((X u ,X u )(X v ,X v )) , 

u 



Table 3.1 

Comparison of node proximity measures. Nodes are defined as "close" when the measure is 
smaller than a threshold. {x'*5}jF - is a set of relaxed test vectors; see the text. 




3.3.1. Existing Proximity Measures. Classical AMG defines proximity based 
on edge weights (Table 13.11 top row). While this has worked well for coarsening 
discretized scalar elliptic PDEs, it leads to wrong aggregation decisions in non-local 



8 



O. E. LIVNE AND A. BRANDT 



graphs. In a grid graph with an extra link between distant nodes u, v (Fig. 13.1b ). u 
and v become proximal and may be aggregated. Unless w uv is outstandingly large, 
this is undesirable because u and v belong to unrelated milieus of the grid. 

This problem is overcome by the algebraic distance measure introduced by Ron et 
al. [55] (Table l3Tl middle row; a related definition is used in the work [12]). Insofar 
as coarsening concerns the space of smooth error vectors x, nodes u and v should be 
aggregated only if x u and x v are highly correlated for all such x. A set of K Test 
Vectors (TVs) x^, . . . ,-xS K > is generated - a sample of this error space [TT] §17.2], 
[55] . Each TV is the result of applying v relaxation sweeps to Ax = 0, starting from 
randomf— 1, 1]. However, Ron et al.'s definition falls prey to a graph containing two 
connected high-degree nodes u and v ("hubs"; Fig. 13.1b ). For each k, the value xffi 
is an average over a large neighborhood of random node values whose size increases 
with the number of sweeps, hence is small. Similarly, x v k ^ is small, so every such u 
and v turns out proximal, even though they may be distant. 

3.3.2. The New Proximity Measure. LAMG's proximity measure, the affin- 
ity ( Table [3TTI bottom), also relies on TVs, but is scale- invariant, and correctly assesses 
both Fig. 13.1b and b as well as many other constellations. The affinity c uv between u 
and v is defined as the goodness of fitting the linear model x v ~ px u to TV values: 

c — i \i. x u,X v )\ (if) -V/ 1 /' X -=(xW x {K] 

uv (X U ,X U )(X V ,X V )> k^i ' ^ '" 

(3-3) 

Cuu = 0, < c uv < 1 and c uv — c vu . The affinity measures distance: the smaller c uv , 
the closer u and v. In the d-D discretized Laplace operator on a grid, c uv related to 
the geometric distance between the gridpoints corresponding to u and v. In general 
graphs, c uv is an alternative definition of the algebraic distance [55 , and approximates 
the diffusion distance at a short time v |22) . 

3.3.3. Statistical Interpretation, be thought of as a random variable; 
c uv is the Fraction of Variance Unexplained of linearly regressing x u on x v using the 
TV samples X u and X v [28]. We make several observations: 

• c uv is invariant to scaling X u and X v . This is vital in the two- hub case 

(Kg. sat). _ 

• Rather than subtracting the sample means X u := Ylk=i x ^ an d Xv from 
X u and X v , respectively, as in the standard statistical definition, we use their 
exact means over all error vectors. These are zero, since each X u is a linear 
combination of some initial X^s, each of which has a zero mean. 

• A weighted inner product (X U ,X V ) could be used to account for TV vari- 
ances, but since all TVs have the same level of smoothness, i.e., comparable 
normalized residuals [TT] §17.2], [T3], they were assigned equal weights. 

A few vectors K and smoothing sweeps v suffice to obtain a good enough c uv estimate, 
which guides a coarsening of the node set by a modest factor of 2-3 (cf. i )3.4.2[) . 

3.3.4. Interpolation Accuracy. Bootstrap AMG pT] §17.2] defines a general- 
caliber P using a least-squares fit to TVs. In our case, 

• \\pXu-Xvf ,„„x 
c uv = mm -2 (3.4) 

p \\Xv\\ 

relates the affinity to the accuracy of the caliber- 1 interpolation formula x v = px u for 
TVs. As a byproduct, we obtain the interpolation coefficient p = (X U ,X V )/(X V ,X V ). 
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Eq. (|3.3p also works for non-zero-sum matrices, such as restricted and normalized 
Laplacians [BT1 §2], where p uv is set to p (see also §i )6.U6.3[) . For the Laplacian, p is 
abandoned in favor of p uv = 1 (cf. M3.1.4)) . 

In the Helmholtz equation, c uv is large for all u,v, indicating that all nodes are 
distant and that no single aggregate set can yield fast AMG convergence (indeed, 
multiple coarse grids are required to restore textbook multigrid efficiency |50j). 

3.4. Aggregation. Aggregation levels refer to the two-level method of 321 with 
a caliber-1 interpolation. P is equivalent to partitioning Af into n c non-overlapping 
aggregates {7u}u<eaT c ', Tu is the set of e„ interpolated from e^, and N c := {1, . . . , n c }. 
Each aggregate consists of a seed node and zero or more associate nodes |49[ Fig. 3.2]. 

This section explains the technical details of aggregate selection, and may be of 
interest to multigrid experts. Other readers may wish to skip it and assess the quality 
of our aggregation decisions via the numerical experiments in SJSJ 

3.4.1. Aggregation Rules. Intuitively, nodes should be aggregated together if 
their values are "close"; ideal aggregates have strong internal affinities and weaker 
external affinities. To this end, we formulated five rules: 

1. Each node can be associated with one seed. 

2. A seed cannot be associated. 

3. Aggregate together nodes with smaller affinities before larger affinities. 

4. Favor aggregates with small energy ratios ( £13.4.41) . 

5. A hub node should be a seed. 

Rules 1 and 2 prevent an associate from being transitively with a distant seed. Oth- 
erwise, long chains might be aggregated together, creating aggregates with weak in- 
ternal connections and very large energy ratios. Rule 3 favors strongly-connected 
aggregates. Rule 4 has a dual purpose: (a) Ultimately, the energy ratio determines 
the AMG asymptotic convergence factor; hence, this rule ensures good convergence, 
(b) Affinities are based on local information (relaxed TVs); their quantitative value 
becomes fuzzier as nodes grow apart [55J §5] . Since small energy ratios usually dictate 
small aggregates, affinities are thus used only for local aggregation decisions. Rule 5 
avoids costly aggregation decisions due to traversing the large neighbor sets of a hubs, 
which would increase the computational work (cf. §3.4.4j) . On the other hand, ag- 
gressively coarsening a large clique into a single node is desirable, as all its nodes are 
strongly correlated. Thus hubs are defined as locally-high degree nodes0 
A typical coarsening ratio in our algorithm ranges between .3-. 5. 

3.4.2. Aggregation Algorithm. The algorithm requires the cycle index 7 as 
an input. Each node is marked as a seed, associate or undecided. First, hubs are 
identified and marked as seeds. Second, edges with very small \w uv \ are discarded 
during aggregation. Since relaxation converges fast at the nodes that become discon- 
nected, they need not be coarsened at all; however, to keep the coarse-level matrix a 
Laplacian, we aggregate all of them into a single (dummy) aggregate. All other nodes 
are marked as undecided. 

Aggregation is performed in r stages: aggregate sets Si,...,S r are generated such 
that each Si-aggregate is contained in some S^+i-aggregate. The set whose coarsening 
ratio a := \Si\/n is closest to a max :— .7/7 is selected as the final set, so that the total 
cycle work would be bounded by ~ 1 + 7a m ax + (7«max) 2 + • • • ~ finest-level units, 



1 In our code, a hub is a node whose degree is significantly larger than a weighted mean of its 
neighbors' degrees [34]: \S U \ > 8J2 v( =e u \w U v\\£v\/ J2 V( =£ U \uiuv\- 
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had the same fill-in occurred at all levels (this is a practical guideline for selecting a 
good coarse set; there exist many sets that yield similar cycle complexities). In our 
code, at most two stages are performed per coarsening level in the cycle. 

In each stage, we scan undecided nodes u and decide whether to aggregate each 
one with a neighbor s that is either an existing seed, or an undecided node that 
thereby becomes a new seed, s is the non-associate neighbor of u with the smallest 
affinity c MS . At the end of the last stage, still- undecided nodes are converted to seeds. 
The complete algorithm is described in the ArXiV e-print [491 §3.4.3]. 

3.4.3. Energy Inflation. The Galerkin coarse-level correction e c (Eq. (|2.2jl ) is 

the best approximation to a smooth error e in the energy norm. Braess [8] noted 
that this does not guarantee a good approximation in the li norm. For example, if 
e is a piecewise linear function in a path graph (1-D grid with w = 1) coarsened by 
aggregates of size two, Pe c is constant on each aggregate and matches e's slope across 
aggregates, resulting in about half the fine- level magnitude. See Fig. 13.2b . 



123456789 10 123456789 10 

(a) (b) 

Fig. 3.2. The Galerkin correction Pe c to a piecewise linear error e in a path graph for (a) 
uniform 1:2 aggregation and (b) variable aggregate size. The meshsize h > is arbitrary. 

An equivalent and more useful observation is that the energy of PTe is twice 
larger than e's, where Te is some coarse representation of e, say, 

(Te), := y^r e « > UeM c . (3.5) 

T is called the aggregate type operator [T6l §2]. (|2 .3[) can be rewritten as 

min (i(y c ) T A c y c -y c P T rl , r:=Ae. (3.6) 



r 



2 



For an ideal interpolation P that satisfies PTe — e, (|3.6[) is minimized by y c = Te. 
A caliber-1 interpolation P still satisfies PTe w e, but the first term in (|3 . 6[) is 
multiplied by the energy inflation factor 

Now (|3.6|) is minimized by y c k q~ 1 Te. As e is not significantly changed by relax- 
ation, its two-level Asymptotic Convergence Factor (ACF) will be p w 1 — 1/q. In 
Fig. EJa, q w 2 and p w .5. 

Several inflation remedies can be pursued: 
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(A) Increase the P's caliber and accuracy. This leads to fill-in troubles fi j3.1.2[) . 

(B) Accept an inferior two- level ACF of 1 — 1/q and increase the cycle index 7 to 
maintain it in a multilevel cycle. Unfortunately, not only does this increase 
complexity, the examples in i )3.4.4l demonstrate that q can be arbitrarily large. 
This ACF cannot be improved by additional smoothing steps either, because 
it is governed by smooth mode convergence. 

(C) Correct the coarse level operator A c to match the fine level operator's energy 
during the setup phase. This option is considered in i j3.4.41 

(D) Modify the coarse level correction Pe c to match the fine level error e during 
the solve phase. This option is pursued in £13.4.51 

3.4.4. Flat Energy Correction. In this scheme, (|2.3p is modified to 

P T AP e c = ^P T (b - Ax) . (3.8) 

The key question is how to choose /j,. Motivated by Fig. 13.2b , and its two-dimensional 
analogue, Braess used fi = 1.8, but his V-cycle convergence for 2-D grid graphs was 
mesh-independent only if a fixed number of levels were used per cycle, and if AMG 
was used as a preconditioner. In fact, no predetermined global factor exists that fits 
all error corrections in scenarios such as Fig. 13.2b . because the coarse-level solution 
depends on all local inflation ratios, which vary among graph nodes. 

On the other hand, a local energy correction factor \x does exist. Indeed, the 
fine-level and coarse-level quadratic energies are separable to nodal energies: 

E(x) = ^ E u (x.) , E u (x) := -- ^ «™ ( x u ~ x v ) 2 , (3.9a) 

£ c (x c ) = J2 E u^°) > E u^ C ) ■= -\ E a uv ( x u - x v? ■ (3.9b) 
ueAf" v-.v^u 

Here E u (x) is the nodal energy at node u, and i?j)(x c ) is the nodal energy at aggregate 
U. The local inflation factor at aggregate U is defined by 

In principle, a local /xjj can be designed using our TVs to at least partially offset qu; 
unfortunately, new difficulties arise (cf. ij6.2[) . Thus we chose to still scale the right- 
hand side by a global [i, but modify the aggregation so that (?[/(x) J$ Q for all smooth 
vectors x and all U £ J\f c , where Q > 1 is a parameter. Under this condition a global 
factor is effective, whose optimal value minimizes the overall convergence factor: 



A*opt = argmm max 

ju>0 !<<?<Q 



argmin max ^ 1 1 — n\ 

Ai>0 



1 - 



2Q (3.11) 



1 



In LAMG, we shoot for Q = 2; hence /x op t — |, and the expected ACF of smooth 
errors is Q/(Q + 1) = |. 

The worst energy ratio Q := max x qu(x) varies considerably with aggregate size, 
shape and alignment. Fig. 13.31 depicts four constellations that may arise in an un- 
weighted 2-D grid graph. (While these examples do not represent every scenario that 
can occur in graphs, they provide necessary conditions under which the algorithm 
must work.) Limiting the aggregate size to 2, for instance, would not prevent case 



12 



O. E. LIVNE AND A. BRANDT 



(d), whose d-dimensional analogue yields an unbounded Q = d + 1. Fortuitously, 
we already possess the tool to signal and avoid bad aggregates: test vectors. The 
algorithm of ^3.4. 21 is modified so that u is only aggregated with a seed s if the local 
energy ratios of all test vectors are sufficiently small. 







4. 


l n 


> — ( 








k i 






r — 






< 




► — < 




1.. ll \r 


p— OH( 


T 


















(a 


, — w 

)Q = 


2 








(b 


Q = 


= 3 
















































































(c 


)Q = 


2 








(d 




= 3 







Fig. 3.3. Coarsening patterns, (a) 1:2 semi-coarsening. The energy ratio of aggregate {5,6} de- 
pends on {%u}}P = i- (b) 1:3 semi-coarsening, (c) 1:2 full coarsening, (d) Staggered semi-coarsening. 



Specifically, we compare the nodal energy E u before and after aggregation for 
each TV. Note that the nodal energy (|3.9a|) is a quadratic in x u and {x v } v ^s lt - Define 

E u (x;y) := ^a uu y 2 - B u (x)y + C„(x) , B u (x) := w uv x v , C„(x) := i ^ u? ut) x^ , 

ve£ u ve£ u 

so E u (x) = E u (x;x u ). The energy inflation that would occur upon aggregating u 
with a seed s is estimated by 

min E u (x^;?/) 
q us := max — - — -, sX . (3.12) 

The numerator is the local energy after a temporary relaxation step is performed at 
u (since the coarse-level correction is executed on a relaxed iterate during the cycle, 
this is the energy it aims to approximate; the papers [18] use a similar idea). The 
denominator is the energy obtained when x u k ^ is set to x s k \ simulating the caliber- 1 
aggregation; more accurate coarse-level energy estimates could be used, but we have 
not pursued them in the lean spirit of LAMG. We aggregate u with the seed s whose 
c us is minimal of all seeds t with q ut < 2.5; if none exist, u is not aggregated at all. 
(Ratios slightly greater than the target Q — 2 are accepted because TVs also contain 
high-energy modes, for which strict ratios are neither attainable nor necessary.) The 
complexity of the aggregation decision is 0(K\£ U \) [49j §3.5.4]. 

Low-degree elimination ( ^3.2|) is advantageous because (a) it largely prevents 
worst case 1-D scenarios such as Fig. 13.2b . where it is impossible to obtain low energy 
ratios without excessively increasing the coarsening ratio; (b) it increases the number 
of neighbors of u and the chance of locating a seed s with small energy inflation. 
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3.4.5. Iterate Recombination. Instead of fixing /i by (|3.8I) . an effectively- 
adaptive energy correction is obtained by modifying the correction to smooth errors 
during the solution cycle. Let I be any level such that I + 1 is an Aggregation level. 
When the cycle switches from level I — 1 to level I, $ sub-cycles are applied to A ( x' = 
h l , where $ is 1 or 2. We save the iterates x' obtained after the pre- relaxation of 
sub-cycle i, and, before switching back to level I — 1, replace the final iterate x' by 

y' = x' + £*! (x{ - x l )+--- + a# (x l d - x l ) , (3.13) 

where {cti}f =1 are chosen so that ||b' — A'y'||2 is minimized (this is an ni x ■& least- 
squares problem solved in 0(ni) time). This iterate recombination |651 §7.8.2] dimin- 
ishes smooth errors x' — x' that were not eliminated by (I + l) th -level corrections. 
Since the initial residuals obtained after interpolation from level / + 1 are not smooth, 
the residual minimization is only effective after x\ — x l is smoothed. To maximize iter- 
ate smoothness, we perform more post- than pre-smoothing relaxations. The optimal 
splitting turned out to be a (l,2)-cycle; cf. gj 

This acceleration is superior to CG because it is performed at all levels. Iterate 
recombination at coarse levels has been long recognized as an effective tool in the 
multigrid literature |65[ Remark 7.8.5]. In LAMG, recombination occurs more fre- 
quently at coarser levels because 7 > 1. Notay's K-cycle [52j [53] employs a similar 
multilevel CG acceleration, however with a much larger cycle index (up to 7 = 4), 
which increases the solver's complexity. 

The aggregation is still modified here as in §3.4.41 to ensure small energy ratios 
and maximum reduction in the residual norm after recombination. 

4. The LAMG Algorithm. 

4.1. Setup Phase. The setup flow is depicted in Fig. 14.11 Its sole input is the 
cycle index 7 > 1 to be employed at most levels of subsequent solution cycles. In our 
program, 7 = 1.5; this choice is discussed in §4.21 The original problem (I = 1) is 
repeatedly coarsened by either elimination or caliber- 1 aggregation until the number 
of nodes drops below 150, or until relaxation converges rapidly. We employ K = 4 
TVs at the finest level, and increase K up to 10 at coarser levels. Each TV is smoothed 
by v — 3 relaxation sweeps. 




Fig. 4.1. LAMG setup phase flowchart. 

4.2. Solve Phase. The solve phase consists of multigrid cycles [T4J §1.4]. Each 

I < L is assigned a cycle index 7' and pre- and post-relaxation sweep numbers v\,v\. 
If level I + 1 is the result of elimination, 7' = 1 and v\=v\= 0; otherwise, 

I . • > ' i 1 1 o (a -n 

7 = \ r , , , ,1 1/1=1,1/9=2. 4.1 

V ■-.{2,.7\£ l+l \/\£ l \} , otherwise, 1 ' 2 V ' 
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At fine levels, 7 = 1.5 is employed. This value is theoretically marginal for attaining 
a bounded multilevel ACF if the smoothest error two- level ACF is ~ | [HI §6.2], as 
implied by p. Ill) for Q — 2. Notwithstanding, worst-case energy ratios occur infre- 
quently, and in practice a smaller ACF is obtained. This issue is further diminished 
by the adaptive energy correction. At coarse levels, 7' is increased to maximize error 
reduction while incurring a bounded work increase. 

Three relaxation sweeps per level provide adequate smoothing, especially in light 
of the coarse-level correction's crudeness. The coarsest problem is solved by relaxation 
(if it is fast) or a direct solver on an augmented system [49J §3.6.3]. Finally, (|1.3bj) is 
enforced by subtracting the mean of x from x at the end of the cycle. 



Level 



Finest 



V 



Aggregation 



Elimination 



Aggregation 



Aggregation 



2 



z 







c 



c 



12 



7 



2 **Ri 




C 



Fig. 4.2. A four-level cycle. A boxed number denotes a number of relaxations. C: the coarsest- 
level solver. M: subtracting the iterate mean. Down-arrows: right-hand side coarsening \3.2\l 
or \3. 8\ ). Up-arrows: coarse-level corrections \4°\ Eg. (3. 4a)] or \2.4f - R$-' a (i? + l)-iterate 
recombination i3.13t . Iterates are saved at the black dots before coarsening. 



All cycle parameters are fixed: no fine tuning or parameter optimization is re- 
quired for specific graphs. The total cycle work is equivalent to about 10 relaxations. 

5. Numerical Results. We provide supporting evidence for LAMG's practical 
efficiency for a wide range of graphs. 

5.1. Smorgasbord. An object-oriented Matlab 7.13 (R2011b) serial LAMG 
implementation was developed and is freely available online |48j . The time- intensive 
functions were implemented in C and ported with the mex compiler [24] . It was 
tested on a diverse set of 3774 real-world graphs with up to 47 million edges, collected 
from The University of Florida Sparse Matrix collection (UF) [25], C. Walshaw's 
graph partitioning archive [BBJ, I. Safro's MLogA results archive at Argonne National 
Laboratory [57], and the FTP site of the DIMACS Implementation Challenges [2"B] . 

Graphs originated from a plethora of applications: airplane and car finite-element 
meshes; RF electrical circuits; combinatorial optimization; model reduction bench- 
marks; social networks; and web and biological networks. If the graph was di- 
rected, it was converted to undirected by summing the weights of both directions 
between each two nodes. Then, if it contained a large negative edge weight with 
w U v < — 10 _5 S«'e£ l^f'l: au weights were made positive by taking their absolute 
values. Finally, the Laplacian matrix A was formed and used. 

Runs were performed on Beagle, a 150 teraflops, 18,000-core Cray XE6 supercom- 
puter at The University of Chicago (we only took advantage of parallelism by dividing 
the collection into equal parts, each of which ran a single AMD node with 2.2 GHz 
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CPU and 32GB RAM). For each graph, a zero-sum random b was generated. LAMG 
setup, followed by a linear solve that started from a random guess and proceeded until 
the residual ^-norm was reduced by 10 10 . Six performance measures were computed: 

• Setup time per edge i se tup- 

• Solve time per edge per significant figure t so i vc . If the residual norm after i 
iterations was rt and p iterations were executed, t so \ ve := i/(mlog 10 (ro/r p )), 
where t was the solve time. 

• Total time per edge ttotai — ^sctup + 10i S oive to solve Ax = b to 10 significant 
figures for a single b. 

• Storage per edge. 

• Asymptotic convergence factor, estimated by (r p /ro) 1 / p . 

• Percentage spent on setup i S etup/itotai- 

LAMG scaled linearly with graph size: both t se t U p an d ^soive were approximately 
constant (Fig. 15.1b : Table 15. 1[) . Times were measured in terms of the most basic 
sparse matrix operation: a matrix- vector multiplication (MVM), because even MVM 
time scaled slightly superlinearly for m > 5 x 10 6 due to loss of memory locality in 
the MATLAB compressed-column format H. In wall clock time, the total time per 
edge was 5.6 x 10 -6 on average, i.e., LAMG performed a linear solve to 10 significant 
figures at 178, 000 edges per second. The LAMG hierarchy required the equivalent of 
storing w 4m edges in memory (Fig. 15.1b ) . Adaptive energy correction provided a 
20% speed up over a flat /U = § and was thus employed in all reported experiments. 
The ACF was better than the expected .33 for flat correction ( §3.4.4j) . 

We compared LAMG with MATLAB's direct solver (the \ operator). Since the 
direct solver ran out of memory for many graphs with over 10 5 edges, we did not 
include it in the plots. LAMG aims at robustness for a wide variety of graphs, and 
should be compared against solvers that do not often break down or require tuning, 
even if they are faster for a subset of the graphs (in analogy, many graphs could be 
solved much faster with a tailored geometric multigrid or classical AMG algorithm). 

An advantage of iterative solvers over direct is their tunable solution accuracy e. 
Since A's entries often incur measurement or modeling errors in applications, it does 
not make sense to solve Ax = b to more than 2-3 significant figures. Furthermore, 
a single multigrid cycle is typically sufficient to solve a nonlinear problems as well as 
time-dependent problems to the level of discretization errors [T4j Chaps. 7,15]. 

We also compared LAMG against a Matlab implementation of CMG, a hybrid 
graph-theoretic- AMG solver [42]. Since CMG doesn't run yet on the Cray architec- 
ture, experiments were performed on a smaller 64-bit Dell Inspiron 580 (3.2 GHz CPU; 
8GB RAM) for 2668 graphs with up to 10 7 edges. Both algorithms successfully solved 
all graphs. Solve times were similar, while LAMG's setup time was thrice larger than 
CMG's. On the other hand, LAMG was much more robust than CMG: it had only 
3 outliers whose solve time was large, as opposed to 26 CMG outliers whose relative 
magnitude was much larger (4 in setup and 22 in solve; see Fig. I5.1b -d and Table l5~2|) . 

5.1.1. LAMG's Outliers. The three solve-time outliers (Table [5~2|) were char- 
acterized by a large portion of small edge weights, which were carried over to all 
coarse matrices and increased coarsening ratios. Since LAMG's work was controlled 
by decreasing 7 via (|4.1|) . a slower cycle resulted. We plan to improve those cases in 
the future by appropriately ignoring weak edges at each level; cf. - 11 



2 T. Davis, private communication. 
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Measure 


LAMG (Beagle) 


LAMG (Dell) 


CMG (Dell) 


Median 


Mean ± Std. 


Median 


Mean ± Std. 


Median 


Mean ± Std. 


^total 


482.9 


585.4 ±406 


558.1 


680.5 ±497 


342.2 


582.6 ± 1088 


^sctup 


199.6 


222.5 ± 108 


234.9 


248.4 ± 113 


66.3 


89.3 ± 111 


^solvc 


27.3 


36.3 ±33 


31.6 


43.2 ±42.3 


25.5 


47.9 ± 106 


ACF 


.107 


.128 ±.12 


.112 


.132 ± .11 


.500 


.495 ±.21 


%Setup 


43.8% 


43.7% ± 13% 


42.4% 


43.4% ± 13% 


21.5% 


22.9% ± 12% 



Table 5.1 

Left column: median and mean LAMG performance on the Beagle Cray for 892 graphs with 
50,000 < m < 4.7 X 10 7 . Middle and right: LAMG vs. CMG performance on a Dell Inspiron for 
794 graphs with 50, 000 < m < 10 7 . Times are measured in matrix-vector multiplications. 



Name 


n 


m 


LAMG (Dell) 


CMG (Dell) 


ACF 


^sctup 


^solvc 


^sctup 


^solvc 


Ill-conditioned Stokes 


20896 


87010 


.66 


420 


323 


66 


25 


Large basis 


440020 


2560040 


.88 


322 


502 


70 


107 


RF circuit simulation 


4690002 


6251251 


.72 


312 


525 


65 


304 


Law citation network 


925340 


6675561 


.24 


169 


15 


152 


2037 


Berkeley-Stanford web 


512501 


3480880 


.17 


168 


18 


1585 


126 


Molecule pscudopotcntial 


268096 


8833823 


.13 


167 


21 


1879 


41 



Table 5.2 

Top section: the three LAMG outliers. Bottom section: three of CMC's 26 outliers. Times are 
measured in matrix-vector multiplications. 
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G 3 G 4 

Fig. 5.3. The four finest levels for the UF Harvard 500 non-planar web graph "Slf. 



5.2. Grids with Negative Weights. Unlike CMG, LAMG is not restricted to 
diagonally-dominant systems, and can also be applied to some graphs with negative 
edge weights w uv , as long as the Laplacian matrix is (or is very close to being) positive 
semi-definite. To demonstrate this capability, we tested LAMG on the following SPS 
2-D grid Laplacians, whose stencils are depicted in Fig. 15.41 
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(a) The standard 5-point finite-difference discretization of U xx + U yy on the unit 
square with Neumann boundary conditions. 

(b) The 13-point 4 th -order finite-difference stencil of U xx + U yy . 

(c) The discretized anisotropic-rotated Laplace operator 

(cos 2 a + e sin 2 a) U xx + (1 — e) sin(2a)U xy + (e cos 2 a + sin 2 a) U yy , (5.1) 

ch a = -7r/4, e = 10~ 2 , 
alignment-agnostic cross-term 



with a — — 7r/4, e — 10 2 , standard 5-point stencil of U xx ,U yy , and an 



U 



■HI 



1 

4h? 



where h is the grid meshsize. Neumann boundary conditions were used, 
(d) The same as (c), but aligning U xy with the northeast and southwest neighbors: 



U X y 



1 



-1 

4 
-1 



(a) 

.12375 -.25250 
.25250 1.5050 
.12375 -.25250 
(c) 



.12375 
-.25250 
-.12375 



1 -16 



-.5000 
.24750 



-1 
-16 

60 
-16 
1 

(b) 
-0.5000 
1.5050 
-0.5000 

(d) 



■16 1 



.24750 
-0.5000 



Fig. 5.4. Stencils of grid Laplacians with negative weights, 
meshsize-independent sum and rounded to five significant figures. 



Entries are normalized to a 



Problems (c) and (d) are bad discretizations that do not align with the characteristic 
direction of (|5.1I) , and are considered hard for AMG [T^] . Performance figures for the 
Dell Inspiron are given in Table [ 



Problem 


m 


L 


ACF 


%Setup 


'total 


(a) 5-point 


2095104 


19 


0.216 


29% 


902 


(b) 13-point 4 th order 


4188160 


20 


0.262 


22% 


1355 


(c) Anis. rot. agnostic 


4188162 


19 


0.816 


5% 


5453 


(d) Anis. rot. misaligned 


3141633 


20 


0.870 


4% 


8136 



Table 5.3 

LAMG performance for grid graphs on a 1024 X 1024 grid with n ■■ 



1048576 nodes. 



LAMG exhibited mesh-independent convergence and run time in all cases and 
scaled linearly with grid size, albeit its convergence was much slower for cases (c) and 
(d), whose negative edge weights are more significant. Compared with the Bootstrap 
AMG method 12 , which focused on accurately finding the characteristic directions 
without sparing setup costs and only presented two-level experiments, LAMG is a 
full multi-level method with a far shorter setup time, although its ACF could also be 
significantly reduced using bootstrap tools. These results are certainly preliminary. 
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5.3. Lean Geometric Multigrid. Higher performance for the Poisson equation 
discretized on a uniform grid can be obtained by a standard 1:2 coarsening in every 
dimension at all levels and employing Gauss-Seidel relaxation in red-black ordering 
[141 §3.6]. LAMG reduces to Lean Geometric Multigrid: standard multigrid cycle with 
index 7 = 1.5, first-order transfers and energy-corrected coarsening. Since the energy 
ratio is 2 for all error modes, a flat correction ^ = 2 is employed in (j3,8jl . 

For the 2-D periodic Poisson problem, this cycle turns out to be a record-breaking 
Poisson solver in terms of asymptotic efficiency: it achieves a convergence factor of .5 
per unit work, versus .67 for the classical multigrid V(l,l) cycle with linear interpola- 
tion and second-order full weighting [491 §4.1]. For other boundary conditions, finding 
the right /1 is not as easy; while supplementary local relaxations near boundaries the- 
oretically ensure attaining the two-level rates [21 §5], it would be more beneficial to 
study the performance of adaptive energy correction in geometric LAMG. 

6. Future Research. Enhancements and adaptations of the LAMG approach 
to related computational problems are outlined below. 

6.1. Coarsening Improvements. The LAMG algorithm of §3] is by no means 
final and may be improved in various ways. 

• The average setup time could be reduced by employing classical AMG with 
no test vectors, and switching to the LAMG strategy only when the former 
fails. 

• In graphs with many weak edges (such as the outliers in Table. I5T21 . efficiency 
may be increased by temporarily ignoring them in the Galerkin operator 
computation, yet keeping track of their total contribution to each aggregate's 
stencil. If a level is reached at which this total is no longer small compared 
with the aggregate's other edge weights, it is reactivated. 

• Currently, a node u can only be aggregated with a direct neighbor s. In 
some problems, u's second-degree neighbors should also be searched to en- 
sure a good aggregation. For instance, in the anisotropic-rotated problem 
Fig. 15. 4H . u should be aggregated along the characteristic direction, i.e., with 
its southeast or northwest neighbor, neither of which is contained in £ u . 

• If no small energy ratio can be found, or if subsequent cycle convergence is 
slow, isolated bottleneck nodes can be de-aggregated. Alternatively, one can 
up the interpolation caliber at these troublesome nodes, provided that this 
does not substantially increase the total coarse edges. 

• Adaptive local relaxation sweeps may improve efficiency in various problems 
such as PDEs with structural singularities [2]. 

Additionally, user-defined parameters could be supplied to treat special graph families 
more efficiently. For instance, if node coordinates are available, they can be used to 
generate smoother initial TVs than the default random initial guess. Optimizing the 
coarsening is most advantageous when Ax = b is solved for multiple b's, since a 
larger setup cost is tolerable. Such is the case in time-dependent problems [3"T] . 

6.2. Local Energy Correction. Instead of a flat /1 = | factor in (I3.8[) . one 

can apply different [i's to different aggregates. We experimented with different energy 
correction schemes, some based on fitting the coarse nodal energies of TVs to their 
fine counterparts (cf. (I3.9L>[) V While this can dramatically curtail energy inflation, 
care must be taken to avert over-fitting that ultimately results in the coarse-level 
correction operator's instability. 

Analogously, one can define a local adaptive /i in the iterate recombination ( ^3.4. 5p 
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at each level I, provided that it is properly smoothed [H, §5.3]. It is unclear whether 
the extra work would be justified in either case. It may turn out useful for problems 
re-solved for many b vectors, or when a larger setup overhead is tolerable. 

6.3. Other Linear Systems. The LAMG caliber-1 aggregation can be applied 
to non-zero row sum matrices, except that the interpolation weights are no longer 1. 
The affinity definition (|3.3I) remains intact, while the corresponding P entry is set 
to p uv := (X U ,X V )/(X V ,X V ) (see (13.41) ). Normally, relaxed TVs yield an accurate 
enough p uv ; in problems with almost-zero modes, e.g., the QCD gauge Laplacian, 
TVs may need to be improved by a bootstrap cycle |13j . 

Further research should be conducted for negative- weight graphs such as the high- 
order finite element and anisotropic grid graphs of £15.21 The reported convergence 
factors can be improved by producing bootstrapped TVs via applying multilevel cycles 
to Ax = 0. The cycle is far more powerful than plain relaxation in damping smooth 
characteristic components, which should lead to more meaningful algebraic distances 
and to correct anisotropic coarsening in a second setup round (much larger spacing 
in the characteristic direction and no coarsening in the cross-characteristic direction). 
The bootstrap procedure should be useful in many other graphs. 

6.4. LAMG Eigensolver. The LAMG hierarchy can be combined with the 
Full Approximation Scheme (FAS) [2j Chap. 8] to find the K lowest eigenpairs of 
A, similarly to the work [44]. We perform the variable substitution x c = e c + Rx, 
transforming the coarse equation (|2.3p into 

A c x c = P T b -I- rx , r:=A c R-P T A, (6.1) 

followed by the fine-level correction x ^— x + P(x c — Rx). The elimination and 
aggregation are both special cases of (|6.1j) . with Rx := xc and R = 0, respectively. 
The analogue of (|2.3j) for coarsening (A — Afcl)xfc = is 

(A c - A fc B c ) x£ = T fc x , r k = (A c - A fc B c ) R - P T (A - A fc B c ) . (6.2) 

Thus a separate affine term appears in the coarse equation of each approximate eigen- 
vector x£, k = 1, . . . ,K. In particular, the elimination of WS. 21 becomes approximate, 
yet (I6.2[) remains linear in as opposed to the exact non-linear Schur complement 
formed by the AMLS method [3J. Gauss-Seidel may be replaced by Kaczmarz relax- 
ation at very coarse levels to prevent the divergence of smooth error modes [44] . 

Alternatively, one can incorporate the LAMG linear solver into a Rayleigh quo- 
tient iteration [3SJ §8.2], [59] . However, FAS is attractive because it also applies to 
general nonlinear problems [14( §8], e.g., quadratic and linear programming. 

7. Conclusion. Laplacian matrices underlie a plethora of graph computational 
applications ranging from genetic data clustering to social networks to fluid dynam- 
ics. To the best of our knowledge, the presented algorithm, Lean Algebraic Multigrid 
(LAMG), is the first graph Laplacian linear solver whose empirical performance ap- 
proaches linear scaling for a wide variety of real-world graphs. Combinatorial Multi- 
grid was also quite successful, performing faster on average, yet with many more 
outliers. The LAMG approach can also be generalized to non-diagonally-dominant, 
eigenvalue and nonlinear problems. 
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